-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: system tests: enable parallel tests #23989
CI: system tests: enable parallel tests #23989
Conversation
For the past two months we've been splitting system tests into two categories: those that CAN be run in parallel, and those that CANNOT. Much work has been done to replace hardcoded names (mycontainer, mypod) with safename(). Hundreds of test runs, in CI and on Ed's laptop, have proven this approach viable. make {local,remote}system now runs in two steps: first the serial ones, then the parallel ones. hack/bats will now recognize the 'ci:parallel' tag and add --jobs (nprocs). This requires some tweaking of leak_check, because there can be umpteen tests running (affecting image/container/pod/etc state) when any given test completes. Rules for enabling parallelization in tests: * use unique container/pod/volume/network names (safename) * do not run 'podman rm -a' or 'rmi -a' * never use the -l (--latest) option * do not run 'podman ps/images' and expect precise output Signed-off-by: Ed Santiago <[email protected]>
Workaround for containers#23292, where simultaneous 'pod create' commands will all start a podman-build of the pause image, but only one of them will be tagged, and the others will leak <none> images. Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
For tests run in parallel, show file number as |nnn| (vs [nnn]) Teach logformatter to distinguish the two, adding 'p' to anchors in parallel tests. Necessary because in this scheme we run bats twice, thus see 'ok 1' twice, and we want to differentiate them. Signed-off-by: Ed Santiago <[email protected]>
Add a few best-practices examples, and add a whole section describing the dos and donts of writing parallel-safe tests. Signed-off-by: Ed Santiago <[email protected]>
When running parallel, multiple tests could be trying to start the registry at once. Make this parallel-safe. Also, use a safer port range for the registry. Something outside of /proc/sys/net/ipv4/ip_local_port_range Sorry, I'm including a FIXME section that I haven't investigated deeply enough. Signed-off-by: Ed Santiago <[email protected]>
Need --layers=false in podman build, otherwise a buildah race can trigger "layer not known" failures: containers/buildah#5674 Signed-off-by: Ed Santiago <[email protected]>
Signed-off-by: Ed Santiago <[email protected]>
...for dealing with flakes in parallel mode Signed-off-by: Ed Santiago <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
LGTM on my end. Let's bite the bullet and get it enabled. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have been great if I had gotten a chance to review this first before merging so quickly. Anyhow no major blocker so it is fine.
gce_instance: *standardvm | ||
timeout_in: 35m | ||
gce_instance: *fastvm | ||
timeout_in: 25m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slowest test in this PR was 23:23 so this looks very close to hitting this already, maybe it was as particular bad run but we should not set it to close IMO
I guess it is fine as you keep converting more files to run parallel so we speed up over the next days/weeks?
# | ||
# This is an example of what NOT to do when enabling parallel tests. | ||
# | ||
# bats test_tags=ci:parallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should likely mention the possibility of a file_tag for the whole file? i.e. do not set it if the file already has it set and a author must check if the whole file is parallel
Also examples of what not to do are bad if they do not show what to do instead. This should be in the style of
do not do
run_podman --name mycontainer $IMAGE top
instead do
local cname="c-$(safename)"
run_podman --name $cname $IMAGE top
etc...
@@ -5,11 +5,10 @@ debug failures. | |||
Quick Start | |||
=========== | |||
|
|||
Look at [030-run.bats](030-run.bats) for a simple but packed example. | |||
Look at [000-TEMPLATE](000-TEMPLATE) for a simple starting point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no docs here in the Readme about parallel tests requirements and I assume many will not look into the template file. I think the parallel testing should have its own section here
I was expecting a week of back-and-forth in review. I tried to be diligent about cleaning up my test-only comments and code, but missed some. I will clean those up today. Thank you. (Documentation will be lower priority. I still have a few more tests to submit) |
Well, here goes. Let's try enabling parallel system tests.
There's still a ton of work left to do, but I think we're ready enough. The remaining work is in #23275, and I will continue submitting piecemeal PRs over time.
We're gonna see flakes. Sorry. The remaining flakes are really hard to track down. OTOH rerun time will be shorter.
Note to reviewers: the only way to handle this is by looking at individual commits.